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f— ^ . The ability to mimic human notions of semantic distance has widespread applications. Some 

C^ ! measures rely only on raw text (distributional measures) and some rely on knowledge sources 

;h ' such as WordNet. Although extensive studies have been performed to compare WordNet-based 

measures with human judgment, the use of distributional measures as proxies to estimate 

semantic distance has received little attention. Even though they have traditionally performed 

QQ ■ poorly when compared to WordNet-based measures, they lay claim to certain uniquely attractive 

features, such as their applicability in resource-poor languages and their ability to mimic both 

\ y ' semantic similarity and semantic relatedness. Therefore, this paper presents a detailed study of 

r \ ■ distributional measures. Partictdar attention is paid to flesh out the strengths and limitations of 

• ' both WordNet-based and distributional measures, and how distributional measures of distance 

Q ■ can be brought more in line with human notions of semantic distance. We conclude with a brief 

discussion of recent work on hybrid measures. 
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Semantic distance is a measure of how close or distant the meanings of two units of 
language are. The units of language may be words, phrases, sentences, paragraphs, or 
documents. The nouns dance and choreography, for example, are closer in meaning than 
the nouns clown and bridge, and so are said to be semantically closer. The semantic dis- 

i^ , tances between words (or more precisely, between concepts) can be used as fundamental 

building blocks for measuring semantic distance between larger units of language. The 
ability to mimic human judgments of semantic distance is useful in numerous natural 
language tasks including machine translation, word sense disambiguation, thesaurus 

^^ , creation, information retrieval, text summarization, and identifying discourse structure. 

This paper describes the state-of-the-art in corpus-based measures of semantic distance 
between these fundamental units of language. It identifies the significant challenges that 
existing approaches to semantic distance face and in the process fleshes out questions 
that lead to a better understanding of why two concepts are considered semantically 
close. The paper concludes with a discussion of new hybrid approaches, that show the 
potential to address these challenges. 

Units of language, especially words, may have more than one possible meaning. 
However, their context may be used to determine the intended senses. For example, star 
can mean both CELESTIAL BODY and CELEBRITY; however, star in — Stars are powered by 
nuclear fusion — refers only to CELESTIAL BODY and is much closer to sun than to famous. 
Thus, semantic distance between words in context is in fact the distance between their 
underlying senses or lexical concepts. Therefore, in this paper, we take word senses 
to be a particular kind of concept. When we refer directly to a concept (written in 
small capitals), it is with the understanding that it is the sense of one or more words. 
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as reflected in the name that we give the concept. We do not, in this paper, consider 
concepts that are unlexicahzed. 

Humans consider two concepts to be semantically close if there is a sharing of some 
meaning. Specifically, two lexical concepts are semantically close if there is a lexical 
semantic relation between the concepts. Putting it differently, the reason why two con- 
cepts are considered semantically close can be attributed to a lexical semantic relation 
that binds them. According to Cruse (1986| l, a lexical semantic relation is a relation 
between lexical units — a surface form along with a sense. As he points out, the number 
of semantic relations that bind concepts is innumerable; but certain relations, such as 
hyponymy, meronymy, antonymy, and troponymy, are more systematic and have en- 
joyed more attention in the linguistics community. However, as |Morris and Hirst (2004l l 
point out, these relations are far out-numbered by others, which they call non-classical 
relations. Here are a few of the kinds of non-classical relations they observed: positive 
qualities (BRILLIANT, KIND), concepts pertaining to a concept (KIND, CHIVALROUS, 
FORMAL pertaining to GENTLEMANLY), and commonly co-occurring words (locations 
such as HOMELESS, SHELTER; problem-solution pairs such as HOMELESS, SHELTER). 

1.1 Semantic relatedness and semantic similarity 

Semantic distance is of two kinds: semantic similarity and semantic relatedness. The 
former is a subset of the latter, but the two terms may be used interchangeably in 
certain contexts, making it even more important to be aware of their distinction. Two 
concepts are considered to be semantically similar if there is a synonymy (or near- 
synonymy), hyponymy (hypernymy), antonymy, or troponymy relation between them 
(examples include APPLES-BANANAS, DOCTOR-SURGEON, DARK-BRIGHT). Two word 
senses are considered to be semantically related if there is any lexical semantic relation 
at all between them — classical or non-classical (examples include APPLES-BANANAS, 
SURGEON-SCALPEL, TREE-SHADE). 

1.2 Human judgments of semantic distance 

Humans are adept at estimating semantic distance; but consider the following ques- 
tions: How strongly will two people agree /disagree on distance estimates? Will the 
agreement vary over different sets of concepts? Are we equally good at estimating 
semantic similarity and semantic relatedness? In our minds, is there a clear distinction 
between related and unrelated concepts or are concept-pairs spread across the whole 
range from synonymous to unrelated? Some of the earliest work that begins to address 
these questions is by Rubenstein and Goodenough (1965|. They conducted quantita- 



tive experiments with human subjects (51 in all) who were asked to rate 65 English 
word pairs on a scale from 0.0 to 4.0 as per their semantic distance. The word pairs 
chosen ranged from almost synonymous to unrelated. However, they were all noun 
pairs and those that were semantically close were also semantically similar; the dataset 
did not contain word pairs that are semantically related but not semantically similar. 
The subjects repeated the annotation after two weeks and the new distance values 
had a Pearson's correlation r of 0.85 with the old ones. 'M iller and Charles (1991| also 
conducted a similar study on 30 word pairs taken from the Rubenstein-Goodenough 
pairs. These annotations had a high correlation (r = 0.97) with the mean annotations 
of Rubenstein and Goodenough (1965 1. |Resnik (1999| l repeated these experiments and 



found the inter-annotator correlation (r) to be 0.90. 
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Table 1 

Different datasets that are manually annotated with distance values. Pearson's correlation 
coefficient (r) was used to determine inter-annotator correlation (last column). 

Dataset Year Language # pairs PoS # subjects r 

Rubenstein and Goodenough 1965 

Miller and Charles 1991 

Resnik and Diab 2000 

Gurevych 2005 

Zesch and Gurevych 2006 

Note: Rubenstein and Goodenough (1965) do not report inter-subject correlation, but 
determine intra-subject correlation to be 0.85 for 36 (out of the 65) word pairs for which 
similarity judgments were repeated by 15 (of the 51) subjects. 



English 


65 


N 


51 


- 


English 


30 


N 


38 


.90 


English 


27 


V 


10 


.76 and .79 


German 


65 


N 


24 


.81 


German 


350 


N,V,A 


8 


.69 



Resnik and Diab (2000| l conducted annotations of 48 verb pairs and found 



inter-annotator correlation (r) to be 0.76 (when the verbs were presented 
without context) and 0.79 (when presented in context). Gurevych (2005[l and 



Zesch, Gurevych, and Miihlhauser (2007| asked native German speakers to mark 



two different sets of German word pairs with distance values. Set 1 was a German 
translation of the Rubenstein and Goodenough (1965| dataset. It had 65 noun-noun 



word pairs. Set 2 was a larger dataset containing 350 word pairs made up of nouns, 
verbs, and adjectives. The semantically close word pairs in the 65-word set were mostly 
synonyms or hypernyms (hyponyms) of each other, whereas those in the 350-word set 
had both classical and non-classical relations with each other. Details of these semantic 
distance benchmarks are summarized in Table [1] Inter-subject correlations (last column 
in Table[T) are indicative of the degree of ease in annotating the datasets. 

It should be noted here that even though the annotators were presented with word- 
pairs and not concept-pairs, it is reasonable to assume that they were annotated as per 
their closest senses. For example, given the noun pair bank and interest, most if not 
all will identify it as semantically related even though both words have more than 
one sense and many of the sense-sense combinations are unrelated (for example, the 
RIVER BANK sense of bank and the SPECIAL ATTENTION sense of interest). The high 
agreement and correlation values suggest that humans are quite good and consistent at 
estimating semantic distance of noun-pairs; however, annotating verbs and adjectives 
and a combination of parts of speech is harder. This also means that estimating semantic 
relatedness is harder than estimating semantic similarity. 

Apart from showing that humans can indeed estimate semantic distance, these 
datasets act as "gold standards" to evaluate automatic distance measures. However, 
lack of large amounts of data from human subject experimentation limits the reliability 
of this mode of evaluation. Therefore automatic distance measures are also evaluated 
by their usefulness in natural language tasks such as those mentioned earlier. 

1.3 Automatic measures of semantic distance 

Automatic measures of semantic distance quantify the semantic distance between word 
pairs. They give values within a certain range (for example, to 1), such that one end 
of this range represent maximal closeness or synonymy, w^hile the other end represents 



maximal distance. Depending on which end is which, measures of semantic distance can 
be classified as measures of distance (larger values indicate greater distance and less 
closeness) and measures of closeness (larger values indicate shorter distance and more 
closeness)!^ A measure of closeness can be easily converted to a measure of distance by 
applying a suitable inverse function, or vice versa. 

Two classes of automatic methods have been traditionally used to determine 
semantic distance. Knowledge-rich measures of concept-distance, such as those of 
[Jiang and Conrath (1997j , Leacock and Chodorow (1998 ), and Resnik (1995) , rely on the 
structure of a knowledge source, such as WordNet, to determine the distance between 
two concepts defined in it|3 Distributional measures of word-distance (knowledge- 
lean measures), such as cosine and a-skew divergence (Lee 2001 ), rely on the distri- 
butional hypothesis, which states that two words tend to be semantically close if they 
occur in similar contexts jFirth 1957| |. Distributional measures rely simply on text (and 
possibly some shallow syntactic processing) and can give the distance between any two 
words that occur at least a few times. 

The various WordNet-based measures have been widely studied 
( Budanitsky and Hirst 2006 Patwardhan, Banerjee, and Pedersen 2003|. The study 



of distributional measures on the whole has received much less attention^ Even 



though, as 'Weeds (2003) and Mohammad and Hirst (2006b) show, they perform 



poorly when compared to WordNet-based measures, the distributional measures of 
word-distance have many attractive features, including their ability to measure both 
semantic similarity and semantic relatedness. Further, they are not dependent on costly 
knowledge sources that do not exist for most languages. This paper therefore focuses 
on distributional measures and analyzes their strengths and limitations. Particular 
attention is paid to the different kinds of distributional measures and their components. 
The motivation is that a better understanding of distributional measures will lead 
to bringing them more in line with human notions of semantic distance, while still 
maintaining their applicability to resource-poor languages and their ability to mimic 
both semantic similarity and semantic distance. 

2. Knowledge-rich approaches to semantic distance 

Before we begin our examination of distributional measures, we look briefly at the 
resource-based measures. In some ways they are complementary to distributional 
measures and so the discussion will set the context for the analysis of distributional 
measures. 

Creation of electronically available ontologies and semantic networks such as 
WordNet has allowed their use to help solve numerous natural language problems 
including the measurement of semantic distance. Budanitsky and Hirst (2006|, 



Hirst and Budanitsky (2005|, and 'Patwardhan, Banerjee, and Pedersen (2003| have 



done an extensive survey of the various WordNet-based measures, their comparisons 



1 A note about terminology: In many contexts, the term distance measures refers to the complete set of 
measures (irrespective of what the different ends of the range signify). In certain other contexts (as in this 
paragraph), distance measures refers only to those measures that give larger values to signify greater 
distance. The context, usually by its reference to this numeric property or lack thereof will make clear the 
intended meaning of the term. 

2 The nodes in WordNet (synsets) represent word senses and edges between nodes represent semantic 
relations such as hyponymy and mer onymy. 

3 See Curran (2004 1 and |Weeds, Weir, and McCarthy (2004| for other work that compares various 
distributional measures. 
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with human judgment on selected word pairs, and their usefulness in applications such 
as real-word spelling correction and word sense disambiguation. Hence, this section 
provides only a brief summary of the major knowledge-rich measures of semantic 
distance. 

2.1 Measures that exploit WordNet's semantic network 

A number of WordNet-based measures consider two concepts to be close if they are 
close to each other in WordNet. One of the earliest and simplest measures is Rada et al.'s 
l|19891 edge-counting method. The shortest path in the network between the two target 
concepts (target path) is determined. The more edges there are between two words, the 
more distant they are. Elegant as it may be, the measure hinges on the largely incorrect 
assumption that all the network edges correspond to identical semantic distance. 

Nodes in a network may be connected by different kinds of lexical relations such 
as hyponymy, meronymy, and so on. Edge counts apart. Hirst and St-Onge's (|19981 
measure takes into account the fact that if the target path consists of edges that belong 
to many different relations, then the target concepts are likely more distant. The idea is 
that if we start from a particular node cj and take a path via a particular relation (say, 
hyponymy), to a certain extent the concepts reached will be semantically related to c^. 
However, if during the way we take edges belonging to different relations (other than 
hyponymy), very soon we may reach words that are unrelated. Hirst and St-Onge's 
measure of semantic relatedness is: 

HS(ci, C2) = C - path length -kxd (1) 

where cj and C2 are the target concepts, d is the number of times an edge pertaining 
to a relation different from that of the preceding edge is taken, and C and k are 
empirically determined constants. More recently, Yang and Powers (2005) proposed a 



weighted edge-counting method to determine semantic relatedness using the hyper- 
nymy/h5rponymy, holonymy/ meronymy, and antonymy links in WordNet. 

[Leacock and Ch odorow (1998) used just one relation (hyponymy) and modified the 
path length formula to reflect the fact that edges lower down in the hyponymy hierarchy 
correspond to smaller semantic distance than the ones higher up. For example, synsets 
pertaining to sports car and car (low in the hierarchy) are much more similar than those 
pertaining to transport and instrumentation (higher up in the hierarchy) even though both 
pairs of nodes are separated by exactly one edge in the hierarchy. Their formula is: 

LC(c„c,) = - log ^^%^ (2) 

where D is the maximum depth of the taxonomy. 

[Resnik (1995| l suggested a measure that combines corpus statistics with WordNet. 
He proposed that since the lowest common subsumer or lowest superordinate (Iso) 
of the target nodes represents what is similar between them, the semantic similarity 
between the two concepts is directly proportional to how specific the Iso is. The more 
general the Iso is, the larger the semantic distance between the target nodes. This speci- 
ficity is measured by the formula for information content (IC) — the negative logarithm 
of the probability of the Iso: 

Res(ci,C2) = /C(/so(ci,C2)) = -logP(/so(ci,C2)) (3) 



Observe that using information content has the effect of inherently scaling the semantic 
similarity measure by the depth of the taxonomy. Usually, the lower the lowest super- 
ordinate, the lower the probability of occurrence of the Iso and the concepts subsumed 
by it, and hence, the higher its information content. 

As per Resnik's formula, given a particular lowest superordinate, the exact po- 
sitions of the target nodes below it in the hierarchy do not have any effect on the 
semantic similarity. Intuitively, we would expect that word pairs closer to the Iso are 
more semantically similar than those that are distant. Jiang and Conrath (1997) and 
Lin (199^ incorporate this notion into their measures which are arithmetic variations of 



the same terms. The Jiang and Conrath (1997 1 measure (/C) determines how dissimilar 
each target concept is from the Iso (JC(ci) — IC(/so(ci, C2)) and IC{c2) — JC(/so(ci,C2))). 
The final semantic distance between the two concepts is then taken to be the sum of 
these differences. Lin (1997| (like Resnik) points out that the Iso is what is common 



between the two target concepts and that its information content is the common infor- 
mation between the two concepts. His formula (Lin) can be thought of as taking the Dice 
coefficient of the information in the two target concepts. 

/C(ci,C2) = 21ogp(/so(ci,C2)) - (log(p(ci)) + (log(p(C2))) (4) 

Lin{c,,c,) = -^^^-p^pM^l^ (5) 

log(p(ci)) + (log(p(C2)) 



Budanitsky and Hirst (2006| showed that the Jiang-Conrath measure 



has the highest correlation (0.850) with the Miller and Charles noun pairs 
and performs better than all these measures in a spelling correction task. 



Patwardhan, Banerjee, and Pedersen (2003 > achieved similar results using the measure 



for word sense disambiguation. 

All of the approaches described above rely heavily (if not solely) on 
the hypernymy/hyponymy network in WordNet; they are designed for, and 
evaluated on, noun-noun pairs. However, more recently, |Resnik and Diab (2000) 
and Yang and Powers (2006b developed measures aimed at verb-verb pairs. 



Resni 



c and Diab (2000| l ported several measures which are traditionally applied 



on the noun hypernjnny/hyponymy network (edge counting, and the measures of 
Resnik (1995| l, and |Lin (1997| ) to the relatively shallow verb troponymy network. 



The two information content-based measures ranked a carefully chosen set of 48 
verbs best in order of their semantic distanceQ |Yang and Powers (2006 ) ported their 
earlier work on nouns | |Yang and Powers 2005| to verbs. In order to compensate for 
the relatively shallow verb troponymy hierarchy and the lack of a corresponding 
holonymy/meronymy hierarchy, they proposed several back-off models — the most 
useful one being the distance between a noun pair that has the same lexical form as 
the verb pair. However, the approach has too many tuned parameters (9 in all) and 
performed poorly on a set of 36 TOEFL word-choice questions involving verb targets 
and alternatives. 

2.2 Measures that rely on dictionaries and thesauri 



Lesk (19861 introduced a method to perform word sense disambiguation using word 



glosses (definitions). The glosses of the senses of a target word are compared with 
those of its context and the number of word overlaps is determined. The sense with the 



4 Only those verbs were selected which require a theme, and the sub-categorization frames had to match. 
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greatest number of overlaps is chosen as the intended sense of the target. Inspired by 
this approach, |Banerjee and Pedersen (2003) proposed a semantic relatedness measure 
that deems two concepts to be more semantically related if there is more overlap in 
their glosses. Notably, they overcome the problem of short glosses by considering the 
glosses of concepts related to the target concepts through the WordNet lexical semantic 
relations such as hyponymy/hypernymy. They also give more weight to larger overlap 
sequences. 'Patwardhan and Pedersen (2006") proposed another gloss-based semantic re- 
latedness measure which performed slightly worse than the extended gloss overlap 
measure in a word sense disambiguation task, but markedly better at ranking the 
[Miller and Charles (1991) word pairs. They create aggregate co-occurrence vectors for a 
WordNet sense by adding the co-occurrence vectors of the words in its WordNet gloss. 
The distance between two senses is then determined by the cosine of the angle between 
their aggregate vectors. Such aggregate co-occurrence vectors are expected to be noisy 
because they are created from data that is not sense-annotated. 

Jarmasz and Szpakowicz (2003| use the taxonomic structure of Roget's Thesaurus to 



determine semantic similarity. Two words are considered maximally similar if they 
occur in the same semicolon group in the thesaurus. Then on, decreasing in similarity 
are word pairs in the same paragraph, words pairs in different paragraphs belonging to 
the same part of speech and within the same category, word pairs in the category, and so 
on until word pairs which have nothing in common except that they are in the thesaurus 
(maximally distant). They show that this simple approach performs remarkably well at 
ranking word pairs and determining the correct answer in sets of TOEFL, ESL, and 
Reader's Digest word choice problems. 

2.3 Challenges 

In this section, we review some of the shortcomings of resource-based measures in order 
to motivate and to compare them with distributional measures that we will introduce 
in Section |3l 

2.3.1 Lack of high-quality WordNet-like knowledge sources. Ontologies, WordNets, 
and semantic networks are available for a few languages such as English, German, and 
Hindi. Creating them requires human experts and it is time intensive. Thus, for most 
languages, we cannot use WordNet-based measures simply due to the lack of a Word- 
Net in that language. Further, even if created, updating an ontology is again expensive 
and there is usually a lag between the current state of language usage /comprehension 
and the semantic network representing it. Further, the complexity of human languages 
makes creation of even a near-perfect semantic network of its concepts impossible. Thus 
in many ways the ontology-based measures are only as good as the networks on which 
they are based. 

On the other hand, distributional measures require only text. Large corpora, billions 
of words in size, may now be collected by a simple web crawler. Large corpora of more- 
formal writing are also available (for example, the Wall Street Journal or the American 
Printing House for the Blind (APHB) corpus). This makes distributional measures very 
attractive. 

2.3.2 Poor estimation of semantic relatedness. Asj Morris and Hirst (2004| pointed out, 
a large number of concept pairs, such as STRAWBERRY-CREAM and DOCTOR-SCALPEL, 
have a non-classical relation between them (STRAWBERRIES are usually eaten with 
CREAM and a DOCTOR uses a SCALPEL to make an incision). These words are not 



semantically similar, but rather semantically related. An ontology- or WordNet-based 
measure will correctly identify the amount of semantic relatedness only if such relations 
are explicitly coded into the knowledge source. Further, the most accurate WordNet- 
based measures rely only on its extensive is-a hierarchy. This is because networks of 
other lexical-relations such as meronymy are much less developed. Further, the net- 
works for different parts of speech are not well connected. All this means that, while 
WordNet-based measures accurately estimate semantic similarity between nouns, their 
estimation of semantic relatedness, especially in pairs other than noun-noun, is at best 
poor and at worse non-existent. On the other hand, distributional measures can be used 
to determine both semantic relatedness and semantic similarity. 

2.3.3 Inability to cater to specific domains. Given a concept pair, measures that rely 
only on WordNet and no text, such as that of Rada et al. (1989| l, give just one distance 



value. However, two concepts may be very close in a certain domain but not so much in 
another. For example, SPACE and TIME are close in the domain of quantum mechanics 
but not so much in most others. Ontologies have been made for specific domains, which 
may be used to determine semantic similarity specific to these domains. However, the 
number of such ontologies is very limited. Some of the more successful WordNet-based 
measures, such as that of Jiang and Cortrath (1997), that rely also on text, do indeed 



capture domain-specificity to some extent, but the distance values are still largely 
shaped by the underlying network, which is not domain-specific. On the other hand, 
distributional measures rely primarily (if not completely) on text, and large amounts of 
corpora specific to particular domains can easily be collected. 

2.3.4 Computational complexity and storage requirements. As applications for lin- 
guistic distance become more sophisticated and demanding, it becomes attractive to 
pre-compute and store the distance values between all possible pairs of words or 
senses. However both WordNet-based and distributional measures have large space 
requirements to do this, requiring matrices of size N x N, where N is very large. In 
case of WordNet-based measures, N is the number of senses (81,000 just for nouns). In 
case of distributional measures, N is the size of the vocabulary (at least 100,000 for most 
languages). Given that the above matrices tend to be sparsqil and that computational 
capabilities are continuing to improve, the above limitation may not seem hugely 
problematic, but as we see more and more natural language applications in embedded 
systems and hand-held devices, such as cell phones, iPods, and medical equipment, 
memory and computational power become serious constraints. 

2.3.5 Reluctance to cross the language barrier. Both WordNet-based and distributional 
measures have largely been used in a monolingual framework. Even though semantic 
distance seems to hold promise in tasks such as machine translation and multi-lingual 
text summarization that inherently involve two or more languages, automatic measures 
of semantic distance have rarely been applied. With the development of the Euro Word- 
Net, involving interconnected networks of seven different languages, it is possible that 
we shall see more cross-lingual work using WordNet-based measures in the future. 
However, such an interconnected network will be very hard to create for more-different 
language pairs such as English and Chinese or English and Arabic. 



5 Even though WordNet-based and distributional measures give non-zero closeness values to a large 
number of term pairs, values below a suitable threshold can be reset to 0. 
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3. Knowledge-lean, distributional approaches to semantic distance 

3.1 The distributional hypotheses: the original and the revised 

Distributional measures are inspired by the maxim "You shall know a word by the 
company it keeps" jFirth 1957t . These measures rely simply on raw text and possi- 
bly some shallow syntactic processing. They are much less resource-hungry than the 
semantic measures, but they measure the distance between words rather than word- 
senses or concepts. Two words are considered close if they occur in similar contexts. 
The context of a target word is usually taken to be the set of words within a certain 
window around it, for example, ±5 words or the complete sentence. The set of contexts 
of a target word is usually represented by the set of words in these contexts, their 
strength of association (SoA) with the target word, and possibly their syntactic relation 
with the target, for example verb-object, subject-verb, and so on. The strength of co- 
occurrence association between the target and another word quantifies how much more 
(or less) than chance the two words occur together in text. Commonly used measures of 
association are conditional probability (CP) and pointwise mutual information (PMI). 
The distance between the sets of contexts of two target words can be used as a proxy 
for their semantic distance, as words found in similar contexts tend to be semantically 
similar — the distributional hypothesis (Firth 1957: Harris 1968). 

The hypothesis makes intuitive sense, as Budanitsky and Hirst (2006| point out: If 



two words have many co-occurring words in common, then similar things are being 
said about both of them and so they are likely to be semantically similar. Conversely, if 
two words are semantically similar, then they are likely to be used in a similar fashion in 
text and thus end up with many common co-occurrences. For example, the semantically 
similar bug and insect are expected to have a number of common co-occurring words 
such as crawl, squash, small, woods, and so on, in a large-enough text corpus. 

The distributional hypothesis only mentions semantic similarity and not semantic 
relatedness. This, coupled with the fact that the difference between semantic related- 
ness and semantic similarity is somewhat nuanced and can be missed, meant that 
almost all work employing the distributional hypothesis was labeled as estimating 
semantic similarity. However, it should be noted that distributional measures can 
be used to estimate both semantic similarity and semantic relatedness. Even though 
Schiitze and Pedersen (1997| and Landauer, Foltz, and Laham (1998^ for example, use 



the term similarity and not relatedness, their LSA-based distance measures in fact es- 
timate semantic relatedness and not semantic similarity. We propose more-specific 
distributional hypotheses that make clear how distributional measures can be used to 
estimate semantic similarity and how they can be used to measure semantic relatedness: 

Hypothesis of the distributionally close and semantically related: 

Two target words are distributionally close and semantically related if they have many 

common strongly co-occurring words. 

(For example, doctor-surgeon and doctor-scalpel. See example co-occurring words in 

TableEl) 

Hypothesis of the distributionally close and semantically similar: 

Two target words are distributionally close and semantically similar if they have many 
common strongly co-occurring words that each have the same syntactic relation with 
the two targets. 

(For example, doctor-surgeon, but not doctor-scalpel. See syntactic relations with example 
co-occurring words in Tabled) 



Table 2 

Example: Common syntactic relations of target words with co-occurring words. 



cut (v) 



Co-occurring words 

hardworking (adj) 



patient (n) 



Semantically similar 
target pair 

doctor (n) 
surgeon (n) 

Semantically related 
target pair 

doctor (n) 
scalpel (n) 



subject-verb 
subject-verb 



noun-qualifier 
noun-qualifier 



subject-object 
subject-object 



subject-verb noun-qualifier subject-object 

prepositional object-verb - prepositional object-object 



The idea is that both semantically similar and semantically related word pairs will 
have many common co-occurring words. However, words that are semantically similar 
belong to the same broad part of speech (noun, verb, etc.), but the same need not be 
true for words that are semantically related. Therefore, words that are semantically 
similar will tend to have the same syntactic relation, such as verb-object or subject- 
verb, with most common co-occurring words. Thus, the two words are considered 
semantically related simply if they have many common co-occurring words. But to be 
semantically similar as well, the words must have the same syntactic relation with co- 
occurring words. Consider the word pair doctor-operate. In a large enough body of text, 
the two words are likely to have the following common co-occurring words: patient, 
scalpel, surgery, recuperate, and so on. All these words will contribute to a high score of 
relatedness. However, they do not have the same S5mtactic relation with the two targets. 
(The word doctor is almost always used as a noun while operate is a verb.) Thus, as per 
the two revised distributional hypotheses, doctor and operate will correctly be identified 
as sen\antically related but not semantically similar. The word pair doctor-nurse, on the 
other hand, will be identified as both semantically related and semantically similar. 

In order to clearly differentiate from the distance as calculated by a WordNet-based 
semantic measure (described earlier in Section IZTT i, the distance calculated by a corpus- 
based distributional measure will be referred to as distributional distance. 



3.2 Corpus-based measures of distributional distance 

We now describe specific distributional measures that rely on the distributional hy- 
potheses; depending on which specific hypothesis they use, they mimic either semantic 
similarity or semantic relatedness. 

3.2.1 Spatial Metrics: Cos, Li, L2. Consider a multidimensional space in which the num- 
ber of dimensions is equal to the size of the vocabulary. A word w can be represented 
by a point in this space such that the component of zw in a dimension (corresponding to 
word X, say) is equal to the strength of association (SoA) of w with x {SoA[w, x)). Thus, 
the vectors corresponding to two words are close together, and thereby get a low dis- 
tributional distance score, if they share many co-occurring words and the co-occurring 
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words have more or less the same strength of association with the two target words. The 
distance between two vectors can be calculated in different ways as described below. 

Cosine. The cosine method (denoted by Cos) is one of the earliest and most widely used 
distributional measures. Given two words Wi and if 2, the cosine measure calculates the 
cosine of the angle between u>i and W2. If a large number of words co-occur with both 
Wi and IV2, then lUi and m52 will have a small angle between them and the cosine will be 
large; signifying a large relatedness/ similarity between them. The cosine measure gives 
scores in the range from (unrelated) to 1 (synonymous). So the higher the value, the 
less distant the target word-pair is. 

Cosizo,, zo,) = , ^-^c(.0uc(..) (P(^ki) X PMZ02)) ^^^ 



where C{w) is the set of words that co-occur (within a certain window) with the word 
zi; in a corpus. In this instantiation of the cosine measure, conditional probability of the 
co-occurring words given the target words is used as the strength of association. 

The cosine was used, among others, by 'Schiitze and Pedersen (19971) and 



Yoshida, Yukawa, and Kuwabara (2003), who suggest methods of automatically gen- 



erating distributional thesauri from text corpora. [Schiitze and Pedersen (1997| | use the 
Tipster category B corpus dHarman 1993 ) (450,000 unique terms) and the Wall Street 
Journal to create a large but sparse co-occurrence matrix of 3,000 medium-frequency 
words (frequency rank between 2,000 and 5,000). Latent semantic indexing (singular 
value decomposition) ( Schiitze and Pedersen 199"7t is used to reduce the dimensionality 
of the matrix and get for each term a word vector of its 20 strongest co-occurrences. The 
cosine of a target word's vector with each of the other word vectors is calculated and 
the words that give the highest scores comprise the thesaurus entry for the target word. 



Yoshida, Yukawa, and Kuwabara (2003 ) believe that words that are closely related 



for one person may be distant for another. They use around 40,000 HTML docu- 
ments to generate personalized thesauri for six different people. Documents used to 
create the thesaurus for a person are retrieved from the subject's home page and a 
web crawler which accesses linked documents. The authors also suggest a root-mean- 
squared method to determine the similarity of two different thesaurus entries for the 
same word. 

Manhattan and Euclidean Distances. Distance between two points (words) in vector space 
can also be calculated using the formulae for Manhattan distance a.k.a. the Li norm 
(denoted by Li) or Euclidean distance a.k.a. the L2 norm (denoted by L2). In the 
Manhattan distance ^ (Dagan, Lee, and Pereira (1997| |, Pagan, Lee, and Pereira (1999| , 



and |Lee (1999| l), the difference in strength of association of wi and W2 with each word 
that they co-occur with is summed. The greater the difference, the greater is the distri- 
butional distance between the two words. Euclidean distance l|8ll (Lee 1999) employs 
the root mean square of the difference in association to get the final distributional 
distance. Both the L^ and L2 norms give scores in the range between (zero distance 
or synonymous) and infinity (maximally distant or unrelated). 



Li{zvi,zv2) = Y2 I ^(^1^1) — P{io\w2) I (7) 

wgC{wi)UC{w2) 
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L2{zoi,iV2)= Y^ {P {w\wi) - P {w\W2)f (8) 

y weC{wi)uC{w2) 

The above formulae use conditional probability of the co-occurring words given a target 
word as the strength of association. 

|Lee (1999j compared the ability of all three spatial metrics to determine the proba- 
bility of an unseen (not found in training data) word pair. The measures in order of their 
performance (from better to worse) were: Li norm, cosine, and L2 norm. Weeds (2003f 
determined the correlation of word pair ranking as per a handful of distributional 
measures with human rankings ( Miller and Charles (1991| l word pairs). She used verb- 



object pairs from the British National Corpus (BNC) and found the correlation of Li norm 
with human rankings to be 0.39. 



3.2.2 Mutual information-based measures: Hindle, Lin. Hindle (1990) was one of the 



first to factor the strength of association of co-occurring words into a distributional 
similarity measureO Consider the nouns n,- and nj. that exist as objects of verb u,- in 
different instances within a text corpus. Hindle used the following formula to determine 
the distributional similarity of «,- and W/t solely from their occurrences as object of Vj: 



Hinobj{vi,nj,ni,) 



mhi{I{vi,nj),I{vi,n^)), 

if I{vi, Hj) > and I{vi, n^) > 
\max{I{vi,nj),I{vi,nk)) \, (9) 

if I{vi,nj) < and I{vi,n^-) < 
^ 0, otherwise 



I{n,v) stands for the pointwise mutual information (PMI) between the noun n and 
verb V (note that in case of negative PMI values, the maximum function captures the 
PMI which is lower in absolute value). The measure follows from the distributional 
hypothesis — the more similar the associations of co-occurring words with the two target 
words, the more semantically similar they are. Hindle used PM|3 as the strength of 
association. Using the minimum of the two PMIs captures the similarity in the strength 
of association of Vj with each of the two novms. 

Hindle used an analogous formula to calculate distributional similarity (Hzns„f,,) 
using the subject-verb relation. The overall distributional similarity between any two 
nouns is calculated by the formula: 

N 
Hin{ni, nj) = ^ {Hin„j,j{vi, wj, ^2) + Hin,^hj{vi, «i, "2)) (10) 

1=0 

The measure gives similarity scores from (maximally dissimilar) to infinity (maximally 
similar or synonymous). Note that in Hindle's measure, the set of co-occurring words 
used is restricted to include only those words that have the same syntactic relation with 
both target words (either verb-object or verb-subject). This is therefore a measure that 



6 See Grefenstette (1992 1 for an approach that does not incorporate strength of association of co-occurring 
words. He, like Hindle (1990), uses syntactic dependencies to to characterize the set of contexts of a target 
word. The Jaccard coefficient is used to determine how similar the two sets of contexts are. 

7 [Hindle (1990) and|Lrn (1998b) both refer to pointwise mutual information as mutual information. 
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mimics sem^antic sim^ilarity and not sem^antic relatedness. A form of Hindle's m^easure 
where all co-occurring words are used, making it a measure that mimics semantic 
relatedness, is shown below: 






min(Z(a;, ivi), l{w, W2)), 

if I{w,w{) > and I{w,W2) > 
\ max{I{w,zvi),I{iu,W2)) \, (11) 

if I{iv,Wi) < and I{w,zo2) < 
0, otherwise 



^vhere C{zv) is the set of words that co-occur with word iv. 



Lin (1998b| suggests a different measure derived from his information-theoretic 



definition of similarity dLin 1998al l. Further, he uses a broad set of syntactic relations 
apart from just subject-verb and verb-object relations and shows that using multiple 
relations is beneficial even for Hindle's measure. He first extracts triples of the form 
{x, r,y) from the partially parsed text, where the word x is related to y by the syntactic 
relation r. Lin defines the distributional similarity between two words, if 1 and W2, as 
follows: 

, ,,. ^ _ ^r,z.)ET(..,)nT(..^) {I{wi,r,w) + l{w2,r,w)) 
Lin(Wi,W2) — — -/ r\ I r- FT 77\ ^^^^ 

where T(x) is the set of all word pairs (r, y) such that the pointwise mutual information 
l[x,r,y), is positive. Note that this is d ifferent from |Hindle (1990j where even the 
cases of negative PMI were considered. Church and Hanks (1990| l showed that it is 



hard to accurately predict negative word association ratios with confidence, and so, 
co-occurrence pairs with negative PMI are ignored. The measure gives similarity scores 
from (maximally dissimilar) to 1 (maximally similar). 

Like Hindle's measure, Lin's is a measure of distributional similarity. However, 
it distinguishes itself from that of Hindle in two respects. First, Lin normalizes the 
similarity score between two words (numerator of lO) by their cumulative strengths 
of association with the rest of the co-occurring words (denominator of (12)1 ). This is a 
significant improvement as now high PMI of the target words with shared co-occurring 
words alone does not guarantee a high distributional similarity score. As an additional 
requirement, the target words must have low PMI with words they do not both co- 
occur with. Second, Hindle uses the minimum of the PMI between each of the target 
words and the shared co-occurring word, while Lin uses the sum. Taking the sum has 
the drawback of not penalizing for a mismatch in strength of co-occurrence, as long as 
If 1 and W2 both co-occur with a word. 

[Hindle (1990| l used a portion of the Associated Press news stories (6 million words) to 
classify the nouns into semantically related classes. Lin (1998bl l used his measure to gen- 



erate a distributional thesaurus from a 64-million-word corpus of the Wall Street Journal, 
San Jose Mercury, and AP Neiuszvire. He also provides a framework for evaluating such 
automatically generated thesauri by comparing them with WordNet-based and Roget- 
based thesauri. He shows that the distributional thesaurus created with his measure 
is closer to the WordNet and Roget-based thesauri than that created using Hindle's 
measure. 

3.2.3 Relative Entropy-Based Measures: KLD, ASD, JSD. 
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Kullback-Leibler divergence. Given two probability mass functions p{x) and q{x), their 
relative entropy D{p\\q) is: 



D(p|k)=EpWlog^ for^W^O (13) 

Intuitively, if p{x) is the accurate probability mass function corresponding to a random 
variable X, then D{p\\q) is the information lost when approximating p(x) by q{x). In 
other words, D(p||(j) is indicative of how different the two distributions are. Relative 
entropy is also called the Kullback-Leibler divergence or the Kullback-Leibler dis- 
tance (denoted by KLD). 

Pereira, Tishby, and Lee (1993) and 'Dagan, Lee, and Pereira (1994) point out that 



words have probabilistic distributions with respect to neighboring syntactically related 
words. For example, there exists a certain probabilistic distribution {di{P{v\n{)), say) of 
a particular noun rii being the object of any verb. This distribution can be estimated 
by corpus counts of parsed or chunked text. Let ^2 (^(c|w2)) be the corresponding 
distribution for noun M2. These distributions {di and ^2) define the contexts of the two 
nouns («! and n2, respectively). As per the distributional hypothesis, the more these 
contexts are similar, the more «i and «2 are semantically similar. Thus the Kullback- 
Leibler distance between the two distributions is indicative of the semantic distance 
between the nouns Wj and n2- 

XLD(ni,n2) = D(di||d2) 

= E.evb P{v\ni) log ^ for P{v\n2) + (14) 

= llv<^vh'{n^)r^vy{n^) P{^\n\) log ^j^ for V(v\n{) ^ 

where Vh is the set of all verbs and Vh (x) is the set of verbs that have x as the object. 
Note again that the set of co-occurring words used is restricted to include only verbs 
that each have the same syntactic relation (verb-object) with both target nouns. This too 
is therefore a measure that mimics semantic similarity and not semantic relatedness. 

It should be noted that the verb-object relationship is not inherent to the measure 
and that one or more of any other syntactic relations may be used. One may also 
estimate semantic relatedness by using all words co-occurring with the target words. 
Thus a more generic expression of the Kullback-Leibler divergence is as follows: 



KLD{wi,W2) = D{di\\d2) 



= T.u,ev PH^i) log |3iy f°^ ^(^l^^2) 7^ (15) 

= LzvGC{wi)uC{w2) PM'^i) log ^^j^ for P{W\W2) ^ 

where V is the vocabulary (all the words found in a corpus). C{w), as mentioned earlier, 
is the set of words occurring (within a certain window) with word iv. 

It should be noted that the Kullback-Leibler distance is not symmetric; that is, the 
distance from zf 1 to zf 2 is not necessarily, and even not likely, the same as the distance 
from W2 to Wi. This asymmetry is counterintuitive to the general notion of semantic 
similarity of words, although Weeds (2003 ) has argued in favor of asymmetric measures. 
Further, it is very likely that there are instances such that P{iVi\v) is greater than for a 
particular verb v, while due to data sparseness or grammatical and semantic constraints. 
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the training data has no sentence where v has the object 102- This makes P{iV2\v) equal 
to and the ratio of the two probabilities infinite. Kullback-Leibler divergence is not 
defined in such cases, but approximations may be made by considering smoothed 
values for the denominator. 

[Pereira, Tishby, and Lee (1993| used KLD to create clusters of nouns from verb- 
object pairs corresponding to the thousand most frequent nouns in the Grolier's Ency- 
clopedia, June 1991 version (10 million words). Dagan, Lee, and Pereira (1994J) used KLD 
to estimate the probabilities of bigrams that were not seen in a text corpus. They point 
out that a significant number of possible bigrams are not seen in any given text corpus. 
The probabilities of such bigrams may be determined by taking a weighted average 
of the probabilities of bigrams composed of distributionally similar words. Use of 
Kullback-Leibler distance as the semantic distance metric yielded a 20% improvement 
in perplexity on the Wall Street Journal and dictation corpora provided by ARPA's HLT 
program (Paul 1991) . 

It should be noted here that the use of distributionally similar words to estimate 
unseen bigram probabilities will likely lead to erroneous results in case of less-preferred 
and strongly-preferred collocations (word pairs). Inkpen and Hirst (2002j point out 
that even though words like task and job are semantically very similar, the collocations 
they form with other words may have varying degrees of usage. While daunting task is 
a strongly-preferred collocation, daunting job is rarely used. Thus using the probability 
of one bigram to estimate that of another will not be beneficial in such cases. 



oc-skew divergence. The a-skew divergence {ASD) is a slight modification of the Kullback- 
Leibler divergence that obviates the need for smoothed probabilities. It has the follow- 
ing formula: 

ASD{wi,W2) = y P{w\wi)\os,—-—, — \J^^\„, , — ^ (16) 

where a is a parameter that may be varied but is usually set to 0.99. Note that the 
denominator within the logarithm is never zero with a non-zero numerator. Also, the 
measure retains the asymmetric nature of the Kullback-Leibler divergence. Lee (2001) 



shows that a-skew divergence performs better than Kullback-Leibler divergence in 
estimating word co-occurrence probabilities. 'Weeds (2003[ l achieves a correlation of 
0.48 and 0.26 with human judgment on the Miller and Charles word pairs using 

ASD{wi, W2) and ASD{w2, wi), respectively. 



Jensen-Shannon divergence. A relative entropy-based measure that overcomes the prob- 
lem of asymmetry in Kullback-Leibler divergence is the Jensen-Shannon divergence 
a.k.a. total divergence to the average a.k.a. information radius. It is denoted by JSD 
and has the following formula: 

JSD{wr,W2) = D [^111^(^1 +d2)) +D ("^2 11^(^1 +^2)) (17) 



V- Id/ I M P{w\wi) 

L ^(^kOiog- 



w€C{w{)\JC(w2) 



\ {P{w\wi) +P(a;|a;2)) 
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p(H.»2)iog ,,^, ,''";^'';' , J (18) 

2 [P{w\ivi) +P(w\lV2)) J 

The Jensen-Shannon divergence is the sum of the Kullback-Leibler divergence between 
each of the individual co-occurrence distributions di and ^2 of the target words with 
the average distribution ( ^^ ^ )- Further, it can be shown that the Jensen-Shannon 
divergence avoids the problem of zero denominator. The Jensen-Shannon divergence 
is therefore always well defined and, like a-skew divergence, obviates the need for 
smoothed estimates. 

The Kullback-Leibler divergence, a-skew divergence, and Jensen-Shannon diver- 
gence all give distributional distance scores from (synonymous) to infinity (unrelated). 

3.2.4 Latent Semantic Analysis. [Landauer, Foltz, and Laham (1998| l proposed Latent 
semantic analysis (LSA), which can be used to determine distributional distance be- 
tween words or between sets of wordslj Unlike the various approaches described earlier 
where a word-word co-occurrence matrix is created, the first step of LSA involves the 
creation of a word-paragraph, word-document, or similar such word-passage matrix, 
where a passage is some grouping of words. A cell for word zv and passage p is populated 
with the number of times zv occurs in p or, for even better results, a function of this 
frequency that captures how much information the occurrence of the word in a text 
passage carries. 

Next, the dimensionality of this matrix is reduced by applying singular 
value decomposition (SVD), a standard matrix decomposition technique. This 
smaller set of dimensions represents abstract (unknown) concepts. Then the origi- 
nal word-passage matrix is recreated, but this time from the reduced dimensions. 
Landauer, Foltz, and Laham (1998| l point out that this results in new matrix cell val- 



ues that are different from what they were before. More specifically, words that are 
expected to occur more often in a passage than what the original cell values reflect are 
incremented. Then a standard vector distance measure, such as cosine, that captures the 
distance between distributions of the two target words is applied. 

LSA was used by Schiitze and Pedersen (1997) , |Turney (200T] | and |Rapp (2003) to 



measure distributional distance, with encouraging results. However, there is no non- 
heuristic way to determine when the dimension reduction should stop. Further, the 
generic concepts represented by the reduced dimensions are not interpretable; that is, 
one cannot determine which concepts they represent in a given sense inventory This 
means that LSA cannot directly be used for tasks such as unsupervised sense disam- 
biguation or estimating semantic similarity of known concepts. LSA is computation- 
ally expensive as singular value decomposition, a key component for dimensionality 
reduction, requires computationally intensive matrix operations. This makes LSA less 
scalable to large amounts of text (Gorma n and Curran 20061. Finally, it too, like other 
distributional word-distance measures conflates the many senses of a word (see Section 
I4.6.1l ahead for more discussion on sense conflation). 

3.2.5 Co-occurrence Retrieval Models. The distributional measures suggested by 
Weeds (2003) are based on a notion of substitutability: the more appropriate it is to 



8 [Landauer, Foltz, and Laham (1998 1 describe it as a measure of similarity, but in fact it is a distributional 
measure that mimics semantic relatedness. 
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substitute word ifi in place of word IV2 in a suitable natural language task, the more 
semantically similar they are. She uses co-occurrence retrieval (the retrieval of words 
that co-occiir with a target word from text) to determine the degree to which one word 
is substitutable by another. The degree of substitutability of W2 with Wi is dependent 
on how the proportion of co-occurrences of ivi that are also co-occurrences of IV2 and 
the proportion of co-occurrences of W2 that are also co-occurrences of ivi . Thus Weeds's 
distributional measures have a precision component and a recall component (which 
may or may not incorporate the strength of co-occurrence association). The final score is 
a weighted sum of the precision, recall, and standard F measure (see equation |[T9]n) . 
The weights determine the importance of precision and recall and are determined 
empirically. If precision and recall are equally important, then it results in a symmetric 
measure which gives identical scores for the distributional similarity of wi with W2 and 
W2 with lUi . Otherwise, we get an asymmetric measure which assigns different scores to 
the two cases. 



CRM{wi,iV2) = 7 



2 xPxR 
P + R 



+ (1-7) 



/5[P] + (1-/5)[R] 



(19) 



7 and /5 are tuned parameters that lie between and 1. 

Both precision and recall can be considered as the product of a core formula (de- 
noted by core) and a penalty function (denoted by penalty). Weeds03 provides six (three 
times two) distinct formulae for precision and recall, depending on the strength of co- 
occurrence (three alternatives) and whether or not a penalty is applied for differences 
in strength of association of common co-occurring words (two alternatives). 

Depending on the strength of association, the CRMs are classified as type-based, 
token-based, and mutual information-based. The CRMs that use simple counts of the 
common co-occurrences and not the strength of associations as core precision and recall 
values are called type-based CRMs (denoted by the superscript type). The CRMs that 
use conditional probabilities of the shared co-occurring words with the target words 
are called token-based CRMs (denoted by the superscript token). The CRMs that use 
pointwise mutual information of the shared co-occurring words with target words 
are called mutual information-based CRMs (denoted by the superscript mi). The core 
precision and recall formulae for type, token, and mutual information-based CRMs are 
listed below: 



c{wi) n c{w2) 

Ciwi) 



type ( \ I >-V"'i; I ^^\W2] I ,„„, 

core/'^ {wx,W2) = - — , /^^^^ ^ , — — (20) 



type , ^ I Ciwx) ^C{W2) \ ,„,, 

core7 (z.i,r.2) = ^^^--^ (21) 

corQ^f'"{wx,W2)= Y. ^{w\ivx) (22) 

core^i''"'{wi,W2) = Yl P{w\iV2) (23) 

w£C{iDi)nc{iD2) 



P is short for P{iui, W2), while R is short for R(wi, w^). The abbreviations are made due to space 
constraints and to improve readability. 
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where C{x) is the set of words that co-occur with x. 

Depending on the penalty function, the CRMs are classified as additive and 
difference-weighted. The CRMs that do not penalize difference in strength of co- 
occurrence are called additive CRMs (denoted by the subscript addi); those that do pe- 
nalize are called difference-weighted CRMs (subscript Aw). The penalty is a conditional 
probability-based function | |26ll27t for the token- and type-based CRMs, and a mutual 
information-based function (|28ll29t for the mutual information-based CRM. 



penaltyF = penaltyT"' = ^ „' ]' /- ' ^^^ (26 

'^ '^' r :/-( P{w\wi) 

ren,u,r - ?".",#'« - '"'"'";[::^!;r/""""" '^'^ 

penalty'^' = ^ ^ ' ^'' ; ^" 28 

'^ ■^' l{w,wi) 

penalty"^' = ^ '^ ' ^'' ; ^" (29 

^ ■^'^ I{ZV,ZV2) 

The six pairs of precision and recall difference-weighted CRMs are thus as follows: 

«>.-.) - "^';^'(::;'r" (3«) 

v-^ mm(P{w\wi),P{w\w2)) 

^r i^i' ^2) = c(z.i) I — (^'^ 

RTi^l' ^l) = r^. ^^"''"'^^ (33) 



^type, X ^C(wl)nC{w2)\ piwM 



P!^f/"{zv^,W2) = E ^(^l^^i) (34) 

RZYi^o,,zv2)= E ^(^l^'^2) (35) 

Plf,^'"{zoi,W2)= E min(P(i(;|i(;i),P(M;|i(;2)) (36) 

zoGC(u)i)nC(a;2) 
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R%l'"{wi,W2) = E mm{P{w\w2),Piw\wi)) (37) 



Kdd\^i'^^2) - — ^^ jhiTTiT^ — ^^^' 

EH;eC(z(;i)^(^'^l) 

^Iwi'^l'^l) = ' 77 



E,.,gc(u,i)nc(u;2)min(I(ri;,zi;i),I(K;,a;2)) 



Note that in case of the difference-weighted token and mutual information-based pre- 
cision and recall formulae, there is a cancellation of a pair of terms obtained from the 
core formulae and the penalty. 

Asymmetry in substitutability is intuitive, as in many cases it may be acceptable to 
substitute a word, say dog, with another, say animal, but the reverse is not acceptable as 
often. Since Weeds uses substitutability as a measure of semantic similarity, she believes 
that distributional similarity between two words should reflect this property as well. 
Hence, like the KuUback-Leibler divergence, all her distributional similarity models are 
asymmetric. 

[Weeds (200 3) extracted verb-object pairs of 2,000 nouns from the British National 
Corpus (BNC). The verbs related to the target words by the verb-object relation were 
used. Thus each of the co-occurring verbs is related to the target nouns by the same 
S5n-itactic relation and therefore the measures mimic semantic similarity, not relat- 
edness. Correlation with human judgment (Miller and Charles word pairs) showed 
that difference-weighted (r = 0.61) and additive mutual information-based measures 
(r = 0.62) performed far better than the other CRMs. 

4. The anatomy of a distributional measure 

Even though there are numerous distributional measures, many of which may seem 
dramatically different from each other, all distributional measures perform two func- 
tions: (1) create distributional profiles (DPs), and (2) calculate the distance between 
two DPs. 

The distributional profile of a word is the strength of association between it and each 
of the lexical, syntactic, and/or semantic units that co-occur with it. Commonly used 
measures of strength of association are conditional probability (0 to 1) and pointwise 
mutual information (—00 to 00). Commonly used units of co-occurrence with the target 
are other words, and so we speak of the lexical distributional profile of a word (lexical 
DPW). The co-occurring words may be all those in a predetermined window around 
the target, or may be restricted to those that have a certain syntactic {e.g., verb-object) 
or semantic (e.g., agent-theme) relation with the target word. We will refer to the former 
kind of DPs as relation-free. Usually in the latter case, separate association values are 
calculated for each of the different relations between the target and the co-occurring 
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Table 3 

Measures of DP distance, measures of strength of association, and standard combinations. 
Measures of strength of association that are traditionally used are marked in bold. The use of 
other measures of association remains to be explored. 



Measures of DP distance Measures of strength of association 



a-skew divergence (ASD) cp coefficient (Phi) 

cosine (Cos) conditional probability (CP) 

Dice coefficient (Dice) cosine (Cos) 

Euclidean distance (L2 norm) Dice coefficient (Dice) 

Hindle's measure (Hin) odds ratio (Odds) 

Kullback-Leibler divergence (KLD) pointwise mutual information (PMI) 

Manhattan distance (Li norm) Yule's coefficient (Yule) 

Jensen-Shannon divergence (JSD) 

Lin's measure (Lin) 



Standard combinations 



a-skew divergence with (p coefficient (ASD-CP) 
cosine with conditional probability (Cos-CP) 
Dice coefficient with conditional probability (Dice-CP) 
Euclidean distance with conditional probability (L2 norm-CP) 
Hindle's measure with pointwise mutual information (Hin-PMI) 
Kullback-Leibler divergence with conditional probability (KLD-CP) 
Manhattan distance with conditional probability (L^ norm-CP) 
Jensen-Shannon divergence with conditional probability (JSD-CP) 
Lin's measure with pointwise mutual information (Lin-PMI) 



units. We will refer to such DPs as relation-constrained. Typical relation-free DPs 
are those of ,Schutze and Pedersen (1997) and Yoshida, Yukawa, and Kuwabara (2003|. 



Typical relation-constrained DPs are those of Lin (1998a) and |Lee (2001[ |. Below are con- 



trived, but plausible, examples of each for the word pulse; the numbers are conditional 
probabilities: 

relation-free DP 

pulse: beat .28, racing .2, groiv .13, beans .09, heart .04, . . . 

relation-constrained DP 

pulse: (beat, subject-verb) .34, (racing, noun-qualifying adjective) .22, {grow, 
subject-verb) .14, . . . 

Since the DPs represent the contexts of the two target words, the distance between 
the DPs is the distributional distance and, as per the distributional hypothesis, a proxy 
for semantic distance. A measure of DP distance, such as cosine, calculates the distance 
between two distributional profiles. While any of the measures of DP distance may be 
used with any of the measures of strength of association, in practice only certain combi- 
nations are used (see Table |3]l and certain other combinations might not be meaningful 
(for example, Kullback-Leibler divergence with cp coefficient). Observe from Table|3]that 
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all standard-combination distributional measures (or at least those that are described in 
this paper) use either conditional probability or PMI as the measure of association. 

4.1 Simple co-occurrences versus syntactically related words 

[Harris (196 81), one of the early proponents of the distributional hypothesis, used syn- 
tactically related words to represent the context of a word. However, the strength 
of association of any word appearing in the context of the target words may 
be used to determine their distributional similarity. ,Dagan, Lee, and Pereira (1997 [l, 
Lee (1999"| , and [Weeds (2003) represent the context of a noun with verbs whose ob- 



ject it is (single syntactic relation), [Hindle ( 1990) represents the context of a noun 
with verbs with which it shares the verb-object or subject-verb relation, while 
Lin (1998b| uses words related to a noun by any of the many pre-decided syntac- 



tic relations to determine distributional similarity. [Schiitze an d Pedersen (1997) and 
Yoshida, Yukawa, and Kuwabara (2003|l use all co-occurring words in a pre-decided 



window size. Although |Lin (1998b| l "shows that the use of multiple syntactic relations 
is more beneficial as compared to just one, McCarthy et al. (2007") show that results 
obtained using just word co-occurrences produced almost as good results as those 
obtained using syntactically related words. Further, use of syntactically related words 
entails the requirement of chunking or parsing the data. 

4.2 Compositionality 

The various measures of distributional similarity may be divided into two kinds as per 
their composition. In certain measures, each co-occurring word contributes to some 
finite calculable distributional distance between the target words. The final score of 
distributional distance is the sum of these contributions. We will call such measures 
compositional measures. The relative entropy-based measures, Lj norm and L2 norm, 
fall in this category. On the other hand, the cosine measure, along with Hindle's and 
Lin's mutual information-based measures, belong to the category of what we call 
non-compositional measures. Each co-occurring word shared by both target words 
contributes a score to the numerator and the denominator of the measures' formula. 
Words that co-occur with just one of the two target words contribute scores only to the 
denominator. The ratio is calculated once all co-occurring words are considered. Thus 
the distributional distance contributed by individual co-occurrences is not calculable 
and the final semantic distance cannot be broken down into compositional distances 
contributed by each of the co-occurrences. It is not clear as to which of the two kinds 
of measures (compositional or non-compositional) resembles human judgment more 
closely and how much they differ in their ranking of word pairs. 

4.2.1 Primary Compositional Measures. The compositional measures of distributional 
similarity (or relatedness) capture the contribution to distance between the target words 
{wi and W2) due to a co-occurring word by three primary mathematical manipulations 
of the co-occurrence distributions (di and ^2)' the difference, denoted by Dif (as in 
Lj norm), division, denoted by Div (as in the relative entropy-based measures), and 
product, denoted by Pdt (as in the conditional probability or mutual information- 
based cosine method). We will call the three types of compositional measures primary 
compositional measures (PCM). Their form is depicted below: 
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Dif= ^ \P{w\wi) - P{w\w2)\ 

ivGC{wi)UC{iV2) 



Div = J2 

ivGC{wi)UC{w2) 

pdt= Yl 

wGC{wi)IJC{w2) 



log 



P{iv\ivi) 



P{lv\zV2) 
P{iv\wi) X P{w\W2) 

Scaling Factor 



(42) 
(43) 
(44) 



Observe that by taking absolute values in expressions (|42l l and (|43] l, the variation in the 
distributions for different co-occurring words has an additive affect rather than one of 
cancellation. This corresponds to our distributional hypothesis — the more the disparity 
in distributions, the more is the semantic distance between the target words. The prod- 
uct form 1(44)1 also achieves this and is based on this theorem: The product of any two 
numbers will always be less than or equal to the square of their average. In other words, 
the more two numbers are close to each other in value, the higher is the ratio of their 
product to a suitable scaling factor (for example, the square of their average). Note that 
the difference and division measures give higher values when there is large disparity 
between the strength of association of co-occurring words with the target words. They 
are therefore measures of distributional distance and not distributional similarity. The 
product method gives higher values when the strengths of association are closer, and is 
a measure of distributional relatedness. 

Although all three methods seem intuitive, each produces different distributional 
similarity values and more importantly, given a set of word pairs, each is likely to rank 
them differently. For example, consider the division and difference expressions applied 
to word pairs {zvi, IV2) and {zv^, IV4). For simplicity, let there be just one word w' in the 
context of all the words. Given: 



P{iv'\zvi) =0.91 

P{iv'\iV2) = 0.80 

P{zv'\w3) = 0.60 

P{iv'\zv4) = 0.50 

The distributional distance between word pairs as per the difference PCM: 



Dif{wi,W2) = |0.91 -0.8| = 0.11 
Dif{wo„w4) = |0.6 - 0.5| = 0.1 

The distributional distance between word pairs as per the division PCM: 



Div{zvi,W2) = 



log 



0.91 
0.8 



0.056 
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Div{zvj,iVi) 



0.6 



= 0.079 



Observe that for the same set of co-occurrence probabiHties, the difference-based mea- 
sure ranks the (lu^, IU4) pair more distributionally similar (lower distributional distance), 
while the division-based measure gives lower distributional similarity values for word 
pairs having large co-occurrence probabilities. This behavior is not intuitive and it 
remains to be seen, by experimentation, as to which of the three, difference, division or 
product, yields distributional similarity measures closest to human notions of semantic 
similarity. 

The Li norm is a basic implementation of the difference method. A simple product- 
based measure of distributional similarity is as proposed below: 

zuGC{wr)UC{w2) iliPH^l) + PHlV2))y 

The scaling factor used is the square of the average probability. It can be proved 
that if the sum of two variables is equal to a constant (k, say), their values must 
be equal to k/2 in order to get the largest product. Now, let k be equal to the sum 
of P{w\wi)/{P{w\wi) + P{iv\iV2)) and P{iv\w2) / {P{w\wi) + P{iv\iV2)). This sum will 
always be equal to 1 and hence the product (Z) will be largest only when the two 
numbers are equal i.e. P{iv\iVi) is equal to P{w\w2)-ln other words, the farther P{w\wi) 
and P{zv\w2) are from their average, the smaller is the product Z. Therefore, the measure 
gives high scores for low disparity in strengths of co-occurrence and low scores other- 
wise. The incorporation of 1/2 in the scaling factor results in a measure that ranges 
between and 1. 

The relative entropy-based methods use a weighted division method. Observe that 
both Kullback-Leibler divergence (formula repeated below for convenience — equation 
(|46|) and Jensen-Shannon divergence do not take absolute values of the division of co- 
occurrence probabilities. This will mean that if P{zv\zvi) > P{zu\zu2), the logarithm of 
their ratio will be positive and if P{zu\zui) < P{iv\iV2), the logarithm will be a negative 
number. Therefore, there will be a cancellation of contributions to distributional distance 
by words that have higher co-occurrence probability with respect to zui and words that 
have a higher co-occurrence probability with respect to ZU2. Observe however that the 
weight P{zu\zui) multiplying the logarithm means that in general the positive logarithm 
values receive higher weight than the negative ones, resulting in a net positive score. 
Therefore, with no absolute value of the logarithm, as in the KLD, the weight plays 
a crucial role. A modified Kullback-Leibler divergence {D ) which incorporates the 
absolute value is suggested in equation ll47llP'^l 

KLDizv,,zv2) = Did,\\d2)= E PH^i)^og^l^ (46) 



10 It should be noted that any changes to the formula for Kullback-Leibler divergence means that the 
resulting measure is no longer Kullback-Leibler divergence; these measures are denoted by KLD (and a 
suitable subscript and/or superscript simply to indicate that they are derived from the Kullback-Leibler 
divergence. 
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w&C(wi)\JC(w2) 



P{w\wi) 
log 



P{w\W2) 



(47) 



The updated Jensen-Shannon divergence measure will remain the same as in equation 
flTl l, except that it is a manipulation of D and not the original Kullback-Leibler 
divergence (relative entropy). 



]SD''^\w^,W2) = D''^'{d4\{dr + d2)) + D^^\d2\\\{dr+d2)) (48) 

Note that once the absolute value of the logarithm is taken, it no longer makes much 
sense to use an asymmetric weight {P{w\wi)) as in the KLD or as necessary to use a 
weight at all. Equation | |49t shows a simple division-based measure. It is an unweighted 
form of KLD^^'{wi, W2) and so we will call it KLD^^' 



UnW 



KLD^^^Jwi, W2) = Div{wi, W2) = Y^ 

weC{wi)UC{w2) 



log P^''\''^^ 



P{w\zV2) 



(49) 



Experimental evaluation of these suggested modifications of Kullback-Leibler diver- 
gence will be interesting. 

4.2.2 Weighting the PCMs. The performance of the primary compositional measures 
may be improved by adding suitable weights to the distributional distance contributed 
by each co-occurrence. The idea is that some co-occurrences may be better indicators 
of semantic distance than others. Usually, a formulation of the strength of association 
of the co-occurring word with the target words is used as weight, the hypothesis being 
that a strong co-occurrence is likely to be a strong indicator of semantic closeness. 

Weighting the primary compositional measures results in some of the existing mea- 
sures. For example, as pointed out earlier, the Kullback-Leibler divergence is a weighted 
form of the division measure (not considering the absolute value). Here, the conditional 
probability of a co-occurring word with respect to the first word {P{w\wi)) is used as the 
weight. Since the absolute value of the logarithm is not taken and because the weight 
{P{w\wi)) is dependent on the first word and not the other, Kullback-Leibler divergence 
is asymmetric. Below is a symmetric weight function: 

1 

weight Avgwti^'^i' '^'^2) = 2 (^(^1^1) + ^(^1^2)) (50) 

L2 norm is a weighted version of the L^ norm, the weight being P{w\ivi) — P{iv\iV2). A 
simple product measure with weights is shown below: 

^-.Aiv \-^ l.„, , N „/ , NN Piwlwi) X P(w\W2) 

^ wGC{za,)UC{ia2)^ ijiPM'^l) + PH^2W 

^ y- P{w\wi) X P{io\i02) 
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A possibly better weight function (which is also symmetric) hinges on the following 
hypothesis: The stronger the association of a co-occurring word with a target word, the 
better the indicator of semantic properties of the target word it is. Equation (|52t shows 
the corresponding weight function: 



/ X max(P(w\iVi),P(w\w2)) 



(52) 



The co-occurring word is likely to have different strengths of associations 
with the two target words. Taking the maximum of the two as the weight 
|[Dagan, Marcus, and Markovitch (1995 1) will mean that more weight is given to a co- 



occurring word if it has high strength of association with any of the two target words. 
As Pagan, Marcus, and Markovitch (1995^ point out, there is strong evidence for dis- 



similarity if the strength of association with the other target word is much lower than 
the maximum, and strong evidence of similarity if the strength of association with both 
target words is more or less the same. 

4.3 Predictors of Semantic Relatedness 

Given a pair of target words, the vocabulary may be divided into three sets: (1) the set 
of words that co-occur with both target words (common); (2) words that co-occur with 
exactly one of the two target words (exclusive); (3) words that do not co-occur with 



either of the two target words. Hindle (1990) uses evidence only from words that co- 



occur with both target words to determine the distributional similarity All the other 
measures discussed in this paper so far also use words that co-occur with just one target 
word. 

One can argue that the more there are common co-occurrences between two words, 
the more they are related. For example, drink and sip may be considered related as they 
have a number of common co-occurrences such as water, tea and so on. Similarly, drink 
and chess can be deemed unrelated as words that co-occur with one do not with the 
other. For example, water and tea do not usually co-occur with chess, while castle and 
move are not found close to drink. Measures that use all co-occurrences (common and ex- 
clusive) tap into this intuitive notion. However, certain strong exclusive co-occurrences 
can adversely effect the measure. Consider the classic strong coffee vs powerful coffee 
example ( Halliday (1966[ l). The words strong and powerful are semantically very related. 
However, the word coffee is likely to co-occur with strong but not with powerful. Further, 
strong and coffee can be expected to have a large value of association as given by a 
suitable measure, say PMI. This large PMI value, if used in the distributional relatedness 
formula, can greatly reduce the final value. Thus it is not clear if the benefit of using all 
co-occurrences is outweighed by the draw^back pointed out. 

A further advantage of using only common co-occurrences is that the Kullback- 
Leibler divergence can now be used without the need of smoothed probabilities. 



KLDcom{wi,W2) = Y^ P{w\wi)\o^ \ (53) 

w€C{w^)nC{w2) ■' ^^^l^"2j 

Observe that we are taking the intersection of the set of co-occurring words instead of 
union as in the original formula (iSll . 
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4.4 Capitalizing on asymmetry 

Given a hypernym-hyponym pair {automobile-car, say) asymmetric distributional mea- 
sures such as the Kullback-Leibler divergence, oc skew divergence, and the CRMs 
generate different values as the distributional distance of Wi with ZU2 as compared 
to that of W2 with w-[. Usually, if wi is a more generic concept than W2, the mea- 
sures find Wi to be more distributionally similar to if 2 than the other way round (see 
( [Mirkin, Pagan, and Geffet 2007) for work on lexical entailment using the Kullback- 
Leibler divergence). |Weeds (2003| l argues that this behavior is intuitive as it is more often 
acceptable to substitute a generic concept in place of a specific one than vice versa, and 
substitutability is a indicator of semantic similarity. 

On the other hand, in most cases the notion of asymmetric semantic similarity is 
counterintuitive, and possibly detrimental. In many natural language tasks, one needs 
the distance between two words and there is no order information. Further, in case two 
words share a hypernym-hyponym relation, they are likely to be highly semantically 
similar. Thus given two words, it may make sense to always choose the higher of the 
two distributional similarity values suggested by an asymmetric measure as the final 
distributional similarity between the two. This way an asynmietric measure (SiniAsym) 
can easily be converted into a symmetric one {SiniMax)/ while still capitalizing on the 
asymmetry to generate more suitable distributional distance values for hypernym- 
hyponym word pairs. Equation | [54l l states the formula for the proposed conversion. 
A specific implementation of the KL divergence formula is given in equation ( |55t : 

SimMax{wl>W2) =max{SimAsym{^lr'U^2)rSimAsym{^V2,'^Vl)) (54) 

KLDMaxi^ir^i) = raax{KLD{wi,W2),KLD{w2,zv-[)) (55) 

Another way to convert an asymmetric measure of distributional distance into a sym- 
metric one is by taking the average (formula |56ll of the two possible similarity values. 
A specific implementation on the KL divergence formula is given in equations | |57t 
through (I6OI 1: 

1 

SimA:,g{ivi,W2) = ^{SimA,y,„{wi,W2) + Sim y^^y, ,1(^2, '^I'l)) (56) 

1 
KLDAvgiwi,W2) = -{KLD{wi,W2) + KLD{w2,ivi)) (57) 

1 \-^ /„/ , N -, P(w\w-[) „, , ,, P(w\W2)\ ,^„, 

= - X: r(^l^i)l°§p7d^ + ^(^''l^2)log-)J-/( (58) 

2 r.eC{u,i)UC(r.2) V P{zv\W2) P(M^l) J 

= 1 E (PH-^inos^PP^-PHw2)los^P^) (59) 

w£C{wi)UC(iV2) ^ ' ' 

4.5 Summarizing the distributional measures 

Table m summarizes the properties of various distributional measures discussed in this 
paper. 
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Table 4: Distributional measures and their properties. 



distributional measure compo- 
measure type sitional 



ASD 



cos 



CRMs 



Hindle 



Jaccard 



CP 



PCM 



formula 



symm- strength of 
etric association 



distance / 

closeness X 



closeness 



closeness 



closeness 



X 



X 



X 



division 



~v^ p7 I vi P(w\wi} 

LweC(wi)UC(w2) ^ l^Fl j iOg aP{w\w2) + {l-ix)P{w\wi) 
Eii,GC(a;i)uC(ii.2)(P("'l"'l)^-P("'l"^2)) 



2xPxR 



P+R 



+ (1-7) 



/5[P] + (1-/5)[R] 



Dice^'' 


closeness 


X 


n.a. 


Dif or Li 


distance 


/ 


difference 


Div 


distance 


■/ 


division 



Ez„eC(H>i) P{^\l"l)+EweC(i02) P{Ml"2) 

LweC(wi)UC{w2) I PMzoi) - P{W\ZV2) 



r-^ , P{w\wi) 

l^weC(wi)UC{w2) ^°§ P(w\w2) 



n.a. HweCiw) ' 



n.a. 



mm{I{w,Wi),I{w,W2)), 

if I(iy, wi) > and I{w,W2) > 
I max{I{w,w-[),I{w,W2)) \, 

if 7(iy, Wi) < and I{w,W2) < 
0, 

otherwise 

I^weC{u,i)uC{w2)^^iP("'M'Pi"'\"'2)) 
T.u,lEC(wi)nC{w2)^MP{Ml"l),P{Ml"2)) 



Note: For measures that are not compositional, the type of PCM is not applicable. 
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CP 
CP 

both 

CP 
CP 
CP 

PMI 



CP 
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Distributional measures and their properties (continued). 
distributional measure compo- 
measure type sitional PCM formula 



}SD 
KLD 

KLDcom 

KLD^ug 
KLDmux 
Li 
Lin 



-* "'■AvgWt 



symm- strength of 
etric association 



distance 



y 



division 



distance 


y 


div. 


distance 


y 


div. 


distance 


y 


div. 


distance 


y 


div. 


distance 


y 


div. 


distance 


y 


difference 


closeness 


X 


n.a. 


closeness 


y 


pdt. 


closeness 


y 


pdt. 



P{w\wi) 






P{w\w2)\o^-_ 



P(w\w2) 



■JW&C{wi)liC{w2) 



2 (P{w\wi)+P{w\w2)) 

^ (^ I ^i) log p(^ir^ 



P[w Wl) 



P(w Wl) 



LzoeC{wi)nc{w2) PHzoi) log p|^ 

LzoeC(zui)UC{w2)PiM'^l) log^^^i^^ 



P{w W2) 
\ Ej^eCK)UC(a;2) (^(«^l«^l) - ^(^1^2)) log ^J|^ 

vas,y.(KLD(iv\, w-i) ,KLD{ivi, if 1)) 

'^T.w<^C(wi)yjC(w2) {P (^1^1) - P H'^2)f 
T.(r,w) eT{wi}n T{W2} ( ^("'l^^'^) +Hl"2,r,w) ) 

r-^ P(lo\lOi)xP(lo\W2) 

L^weC{wi)UC{w2) {:^{p{w\wi)+P(w\w2W 

r-^ P(w\Wi)xP(w\W2) 

L^w<^C{wi)yjC(w2) l{p(a;|a;j)+p(a,|a,2)) 



y 



y 



y 



CP 



X 


CP 


X 


CP 


X 


CP 


y 


CP 


y 


CP 


y 


CP 



PMI 
CP 
CP 



Note: For measures that are not compositional, the type of PCM is not applicable. 
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4.6 Challenges 

4.6.1 Conflation of word senses. The distributional hypothesis jFirth 1957| | states that 
words that occur in similar contexts tend to be semantically close. But when words have 
more than one sense, it is not at all clear what semantic distance between them actually 
means. Further, a word in each of its senses is likely to co-occur with different sets of 
words. For example, bank in the FINANCIAL INSTITUTION sense is likely to co-occur with 
interest, money, accounts, and so on, whereas the RIVER BANK sense might have words 
such as river, erosion, and silt around it. Since words that occur together in text tend 
to refer to senses that are closest in meaning to one another, in most natural language 
applications, what is needed is the distance between the closest senses of the two target 
words. However, because distributional measures calculate distance from occurrences 
of the target word in all its occurrences and hence all its senses, they fail to get the 
desired result. Also note that the dimensionality reduction inherent to latent semantic 
analysis, a special kind of distributional measure, has the effect of making the pre- 
dominant senses of the words more dominant while de-emphasizing the other senses. 
Therefore, an LSA-based approach will also conflate information from the different 
senses, and even more emphasis will be placed on the predominant senses. Given the 
semantically close target nouns ^lay and actor, for example, a distributional measure will 
give a score that is some sort of a dominance-based average of the distances between 
their senses. The noun play has the predominant sense of CHILDREN'S RECREATION 
(and not DRAMA), so a distributional measure will tend to give the target pair a large 
(and thus erroneous) distance score. WordNet-based measures do not suffer from this 
problem as they give distance between concepts, not words. 

4.6.2 Lack of explicitly-encoded world knowledge and data sparseness. It is becoming 
increasingly clear that more-accurate results can be achieved in a large number of 
natural language tasks, including the estimation of semantic distance, by combining 
corpus statistics with a knowledge source, such as a dictionary, published thesaurus, or 
WordNet. This is because such knowledge sources capture semantic information about 
concepts and, to some extent, world knowledge. For example, WordNet, as discussed 
earlier, has an extensive is-a hierarchy. If it lists one concept, say German Shepherd 
as a hyponym of another, say DOG, then we can be sure that the two are semantically 
close. On the other hand, distributional measures do not have access to such explicitly 
encoded information. Further, unless the corpus used by a distributional measure has 
sufficient instances of GERMAN SHEPHERD and DOG, it will be unable to deem them 
semantically close. Since Zipf's law seems to hold even for the largest of corpora, 
there will always be words that occur too few times to accurately determine their 
distributional distance from others. 

4.6.3 Limitations shared with WordNet-based measures. In addition to the limitations 
described above, which are unique to the knowledge-lean distributional measures, like 
the knowledge-rich measures they also suffer from problems of requiring the calculation 
of large distance matrices (as described in Section 12.3.41 earlier) and the reluctance to 
cross the language barrier (Section l2.3.5|l . 

5. A hybrid approach: Distributional measures of concept-distance 

So far we have looked at approaches that exploit the structure of a resource such as 
WordNet, and corpus-based distributional approaches that make use of co-occurrence 
statistics. A hybrid approach to semantic distance is one that reconciling the two. 
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combining the information about concepts, explicitly encoded in a linguistic re- 
source, with the information about words, implicitly encoded in text by co-occurrence. 
|Mohammad and Hirst (2006b) and Mohammad et al. (2007) have proposed one such 



approach that combines corpus statistics with a published thesaurus to give the seman- 
tic distance between concepts (rather than words). 

5.1 The distributional hypothesis for concepts 

As pointed out in Section 14.6.11 words when used in different senses tend to keep 
different "company" (co-occurring words). For example, consider the contrived but 
plausible distributional profile of star. 

star: space 0.21, movie 0.16, famous 0.15, light 0.12, constellation 0.11, heat 0.08, rich 0.07, 
hydrogen 0.07, . . . 

Observe that it has words that co-occur both with star's CELESTIAL BODY sense and star's 
CELEBRITY sense. Thus, it is clear that different senses of a word may have very different 
distributional profiles. Using a single DP for the word will mean the union of those pro- 
files. While this might be useful for certain applications, |Mohammad and Hirst (2006b| l 
argue that in a number of tasks (including estimating semantic distance), acquiring 
different DPs for the different senses is not only more intuitive, but also, as they show 
through experiments, more useful. They show that the closer the distributional profiles 
of two concepts, the smaller is their semantic distance. Below are example distributional 
profiles of two senses of star: 

CELESTIAL BODY: space 0.36, light 0.17, constellation 0.11, hydrogen 0.07, . . . 
CELEBRlTY:/flmoMS 0.24, movie 0.14, rich 0.14, fan 0.10, . . . 

The values are the strength of association (usually pointwise mutual information or 
conditional probability) of the target concept with co-occurring words. 

We have seen that creating distributional profiles of words involves simple word- 
word co-occurrence counts. The creation of DPCs, on the other hand, requires: (1) a con- 
cept inventory that list all the concepts and words that refer to them, and (2) counts of 
how often a concept co-occurs with a word in text. These two aspects will be discussed 
in the next two sub-sections; however once created, any of the many distributional 
measures can be used to estimate the distance between the DPs of two target concepts 
(just as in the case of traditional word-distance measures, where distributional measures 
are used to estimate the distance between the DPs of two target words). For example, 
here is how Mohammad and Hirst adapt the formula for cosine (described earlier in 
Section |3.2.1| | to estimate distributional distance between two concepts: 

C0Scp{ci,C2) = , ^^ (61) 

C{x) is now the set of words that co-occur with concept x within a pre-determined 
window. The conditional probabilities in the formula are taken from the distributional 
profiles of concepts. 
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5.1.1 A suitable knowledge source and concept inventory. 



Mohammad and Hirst (2006b J use the categories in the Macquarie Thesaurus, 812 
in all, as very coarse-grained word senses or concepts, in contrast to approaches that 
use WordNet or other similarly fine-grained sense inventories!!^ Their approach to 
determining word-concept co-occurrence counts (described in the next sub-section) 
requires a set of possibly ambiguous words that together unambiguously represent 
each concept — for which a thesaurus is a natural choice. The use of categories in a 
thesaurus as concepts means that this approach requires a concept-concept distance 
matrix of size only 812 x 812 — much smaller than (less than 0.01% of) the matrix 
required by the WordNet-based and distributional measures. 



5.1.2 Estimating distributional profiles of concepts. A word-category co-occurrence 
matrix (WCCM) is created having word types iv as one dimension and thesaurus 
categories c as another. 





C\ C2 . 


• C/ ••• 


Wi 


mil mi2 ■ 


. niy . . . 


W2 


m21 7«22 . 


. m2j . . . 


Wi 


niji mi2 . 


. niij . . . 



The matrix is populated with co-occurrence counts from a large corpus. A particular cell 
niij, corresponding to word zw, and category or concept c,, is populated with the number 
of times w, co-occurs (they use a window of ±5 words) with any word that has c, as one 
of its senses (i.e., Wi co-occurs with any word listed under concept c, in the thesaurus). 
This matrix, created after a first pass of the corpus, is called the base word-category 
co-occurrence matrix (base WCCM). A contingency table for any particular word iv 
and category c can be easily generated from the WCCM by collapsing cells for all other 
words and categories into one and summing up their frequencies. 



w 






A suitable statistic, such as pointwise mutual information or conditional probability, 
will then yield the strength of association between the word and the category. 

As the base WCCM is created from unannotated text, it will be noisy but nonethe- 
less capture strong word-category co-occurrence associations reasonably accurately. 
This is because the errors in determining the true category that a word co-occurs 
with will be distributed thinly across a number of other categories. (For more discus- 
sion of the general principle see [Resnik (1998 ).) A second pass of the corpus is made 
and the base WCCM is used to disambiguate the words in it. A new bootstrapped 
WCCM is created such that each cell m,,, corresponding to word Wj and concept c,-, 
is populated with the number of times zf,- co-occurs with any word used in sense c,-. 



11 It tias been suggested f or some time now that WordNet is much too fine-grained for certain natural 
language applications ^Agirre and Lopez de Lacalle Lekuona (2003) and citations therein). 
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IMohammad and Hirst (2006a^ showed that this WCCM, created after simple sense dis- 
ambiguation, better captures word-concept co-occurrence values, and hence strengths 
of association values, than the base WCCM. 



5.1.3 Mimicking semantic relatedness and semantic similarity. The distributional 
profiles created by the above methodology are relation-free. This is because (1) all 
co-occurring words (not just those that are related to the target by certain syntactic 
relations) are used, and (2) the WCCM, as described above, does not maintain separate 
counts for the different syntactic relations between the target and co-occurring words. 
Thus, distributional measures that use these WCCMs will estimate semantic related- 
ness between concepts. Distributional measures that mimic semantic similarity, which 
require relation-constrained DPCs, can easily be created from WCCMs that have rows 
for each word-syntactic relation pair (rather than just for words). 



5.1.4 Performance. Mohammad and Hirst (2006b| l evaluate this approach on two tasks: 



ranking word pairs in order of their semantic distance and correcting real-word spelling 
errors. On both tasks, distributional concept-distance measures markedly outperformed 
distributional word-distance measures. The WordNet-based measures performed better 
in the word-pair ranking task, but in the spelling correction task three of the four 
distributional measures outperformed all WordNet-based measures except the Jiang- 
Conrath measure. It should be noted, however, that these experiments evaluated only 
semantic similarity of noun-noun pairs — for all other part-of-speech combinations and 
semantic relatedness estimates, the WordNet-based measures are markedly less accu- 
rate. 



5.2 Multilinguality 

Some of the best algorithms for semantic distance cannot be applied to most languages 
because of a lack of high-quality linguistic resources. Mohammad et al. (2007 1 showed 
how text in one language Li can be combined with a knowledge source in another L2 
using a bilingual lexicon L1-L2 and a bootstrapping and concept-disambiguation algo- 
rithm to create cross-lingual distributional profiles of concepts. These cross-lingual 
DPCs model co-occurrence distributions of concepts, as per a knowledge source in 
one language, with words from another language, and obtain semantic distance in a 
resource-poor language using a knowledge source from a resource-rich one. Cross- 
lingual semantic distance and cross-lingual DPCs are also useful in tasks that inherently 
involve two or more languages, such as machine translation, multilingual multidocu- 
ment tasks of clustering, coreference resolution, and information retrieval. We summa- 
rize their approach here using German as Li and English as L2; however, the algorithm 
is language-pair independent. 



5.2.1 Cross-lingual senses, cross-lingual distributional profiles, and cross-lingual dis- 
tributional distance. Given a German word zi?* in context, [ Mohammad et al. (2007) 
use a German-English bilingual lexicon to determine its different possible English 
translations. Each English translation w"^ may have one or more possible coarse senses, 
as listed in an English thesaurus. These English thesaurus concepts (c™) will be referred 
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CELESTIAL RIVER FINANCIAL 

CELEBRITY BODY BANK INSTITUTIONFURNITURE JUDICIARY } c" 



Star bank bench } w™ 

\ i f 

Stern Bank } ^" 

Figure 1 

The cross-lingual candidate senses of German words Stern and Bank. 



to as the cross-lingual candidate senses of the German word w . Figure [T] depicts 

examples o 

As in the monolingual estimation of distributional concept-distance, the distance 
between two concepts is calculated by first determining their DPs. However, a concept 
is now glossed by near-synonymous words in an English thesaurus, whereas its profile 
is made up of the strengths of association with co-occurring German words. Here are 
constructed example cross-lingual distributional profiles of the two cross-lingual candi- 
date senses of the German word SterwFl 

CELESTIAL BODY (celestial body, sun, ...): Raum 0.36, Licht 0.27, Konstellation 0.11, . . . 
CELEBRITY {celebrity, hero, ...): beruhmt 0.24, Film 0.14, reich 0.14, . . . 

The cross-lingual DPCs are created from a cross-lingual word-category co-occurrence 
matrix without the use of any word-aligned parallel corpora or sense-annotated data 
(as described in the next subsection). Just as in the case of monolingual distributional 
concept-distance measures, distributional measures can be used to estimate the distance 
between the cross-lingual DPs of two target concepts. For example, the cosine formula 
can be adapted to estimate cross-lingual distributional distance between two concepts 
as shown below: 

Cos (Cf ', C|" ) = gC(Ci )UC(c, ) V ^ IJ 12/7 ^^^^ 

C{x) is now the set of German words that co-occur with English concept x within a pre- 
determined window. The conditional probabilities in the formula are taken from the 
cross-lingual DPCs. 

5.2.2 Creating cross-lingual word-category co-occurrence matrix. A German-English 
cross-lingual word-category co-occurrence matrix has German word types iv as one 



12 They are called "candidate senses" because some of the senses of iif" might not really be senses of lu*. 
For example, celestial body and celebrity are both senses of the English word star, but the German 
word Stern can only mean celestial BODY and not celebrity. Similarly, the German Bauk can mean 
FINANCIAL INSTITUTION Or FURNITURE, but not RIVER BANK Or JUDICIARY. An automated System has no 
straightforward method of teasing out the actual cross-lingual senses of zy* from those that are an artifact 
of the translation step. 

13 Vocabulary of German words needed to understand this discussion: Bank: 1. financial institution, 2. 
bench (furniture); beriihmt: famous; Film: movie (motion picture); Himmelskorper: heavenly body; 
Konstellation: constellation; Licht: light; Morgensonne: morning sun; Raum: space; reich: rich; Sonne: 
sun; Star: star (celebrity); Stem: star (celestial body) 
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CELESTIAL BODY 

\ * ] 

celestial body sun star 

\ I ] \ 1 

Himmelskorper Sonne Morgensonne Star Stern 

Figure 2 

Words having CELESTIAL BODY as one of their cross-lingual candidate senses. 



dimension and English thesaurus concepts c*"" as another. 
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The matrix is populated with co-occurrence counts from a large German corpus. A 
particular cell m,,-, corresponding to word w^ and concept c^", is populated with the 

number of times the German word zf* co-occurs (say a window of ±5 words) with 
any German word having c^" as one of its cross-lingual candidate senses. For example, 
the Raum-CELESTIAL BODY cell will have the sum of the number of times Rauni co- 
occurs with Himmelskorper, Sonne, Morgensonne, Star, Stern, and so on (see Figure |2]l. 
This matrix, created after a first pass of the corpus, is called the cross-lingual base 
WCCM. A contingency table for any particular German word w and English category 
c*^" can be easily generated from the WCCM by collapsing cells for all other words 
and categories into one and summing up their frequencies. A suitable statistic, such 
as PMI or conditional probability, will yield the strength of association between the 
German word and the English category. Then a new bootstrapped cross-lingual WCCM 
is created, just as in the monolingual case. 



5.2.3 Performance. Mohammad et al. (2007| | evaluated the cross-lingual measures of 



semantic distance on two tasks: (1) estimating semantic distance between words and 
ranking the word pairs according to semantic distance, and (2) solving Reader's Digest 
'Word Power' problems. They compared these results with those obtained by conven- 
tional state-of-the-art monolingual approaches with and without a knowledge source 
in the target language Li (GermaNet). The cross-lingual approach obtained much better 
results than monolingual approaches that do not use a knowledge source. Further, in 
both tasks, the cross-lingual measures performed as well if not slightly better than the 
GermaNet-based monolingual approaches, as well. This shows that the cross-lingual 
approach is able to keep losses due to the translation step at a minimum, while allowing 
the use of a superior knowledge source in another language to get better results. 

5.3 Challenges 

Distributional measures of concept-distance have many desirable features of both 
knowledge-rich approaches and strictly corpus-based approaches — they have the high 
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accuracies of knowledge-rich approaches, they can measure both semantic relatedness 
and semantic similarity, and they have a strong corpus-reliance making them domain 
adaptable. Further, with the cross-lingual approach, a lack of high-quality knowledge 
source in the target language is no longer a problem. However, certain issues remain. 

5.3.1 Integrating domain-specific terminology. The reliance on a knowledge source 
means that the approach cannot measure the distance between words not listed in the 
thesaurus. This is especially a problem for domain-specific jargon, which might not find 
place in a general purpose knowledge source. Automatic ways of integrating domain- 
specific terminology into a general purpose knowledge source will be valuable to this 
end. 

5.3.2 Choosing the right concept-granularity. [Mohammad and Hirst (2006b l| and 



Mohammad et al. (2007) have reported results using the categories of the thesaurus as 



very coarse word senses. This level of granularity has worked well for the tasks they 
experimented with; however, a relatively finer sense inventory may be more suitable for 
other tasks. Words within a thesaurus category are grouped into paragraphs; and using 
them (instead of categories) and determining when this finer-grained sense-inventory 
is more suitable for use remains to be explored. 

5.3.3 Identifying lexical semantic relations. Word pairs can be semantically close 
because of any of the classical lexical semantic relations, such as hypernymy, near- 
S5'nonymy, antonymy, troponymy, and meronymy, or the innumerable non-classical 
relations. The various distributional approaches discussed in this paper determine 
semantic distance without explicitly identifying the nature of the relationship. Already, 
there is some work on determining lexical entailment ( Mirkin, Pagan, and Geffet 200^ 
and determining near-synonymy l|Lin et al. 2003b . Identifying antonymy (or more gen- 
erally, contrasting word-pairs) is especially useful in many natural language tasks, even 
if it is simply to discard them from a list of distributionally close terms. Also, it will 
be interesting for measures of semantic distance to characterize the nature of any non- 
classical relationship shared by two words — not only determining if two terms are close 
but also specifying (in some intuitive way) the aspect of meaning they share. 

6. Conclusion 

A large number of important natural language problems, including machine transla- 
tion, information retrieval, and word sense disambiguation, can be viewed in part as 
semantic distance problems. Numerous measures of semantic distance exist — those that 
use a knowledge source and those that rely on corpora. Yet, their use in real-world 
applications has been limited. In this paper, we investigated how automatic measures 
can be brought more in line with human notions of semantic distance, how they can be 
made applicable to a large number of natural language tasks, and how they can be used 
even for languages deficient in high-quality linguistic resources. 

Even though corpus-based distributional measures of distance have traditionally 
performed poorly when compared to WordNet-based measures, we have shown that 
(1) there are a number of reasons that make distributional measures uniquely attractive, 
and (2) that their potential is yet to be fully realized. Distributional measures can be 
easily applied to most languages (they can make do even with just raw text) and 
they can be used to mimic both semantic similarity and semantic relatedness. With 
this in mind, the paper presented a detailed study of many important distributional 
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measures, analyzed their limitations, and explained why their performance has been 
relatively poor so far. Understanding these limitations is crucial in the development 
of new and better approaches, whether they have a distributional base or otherwise. 
We concluded the paper with the discussion of a hybrid, yet distinctly distributional 
approach, that presents one way to more accurately measure distributional distance 
without compromising too much on essential properties such as the applicability to 
resource-poor languages. 
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